Method and apparatus for adapting a default encoding of a digital video signal during a scene change period

ABSTRACT

The frame following a scene cut is usually coded as an I picture. In CBR encoding, the encoder will try to keep the bit rate constant, which will often cause serious picture quality degradation at scene changes. In VBR encoding, more bits will be allocated to the first frame of the new scene and the bit rate will increase significantly for a short time. Therefore subsequent frames must be coded in ‘skipped’ mode, which will often cause jerk artifacts. According to the invention, in each frame belonging to a scene change period, areas are determined that have different human attention levels. In the frames (n−1, n−2, n−3) located prior to the first new scene frame, to the areas having a lower attention level less bits are assigned than in the default encoding, and in the frames (n, n+1, n+2) located at and after the scene cut the thus saved bits are additionally assigned to the areas having a higher attention level.

The invention relates to a method and to an apparatus for adapting adefault encoding of a digital video signal during a scene change period.

BACKGROUND

Hybrid video coding techniques have been widely adopted in video codingstandards like H.263, MPEG-2 and H.264/MPEG-4 AVC. Intensive work hasbeen carried out on improving the visual quality within a given bit rateconstraint, using the existing coding tools. Generally CBR (constant bitrate control) and VBR (variable bit rate control) are used to meet thetrade-off between quality and rate constraint for differentapplications. In CBR mode, the number of bits that can be transmitted toa video decoder in a given time interval is typically fixed. The decoderside will also use a buffer of specified size referred to as the videobuffer verifier (VBV) in MPEG2 and MPEG4-2 or as Hypothetical ReferenceDecoder (HRD) in H.263 and MPEG4-AVC/H Related applications are e.g. TVbroadcast, cable transmission, and wireless communication of compressedvideo. In VBR mode, the total number of bit used to compress a longsequence of video is typically fixed, while limits on instantaneous bitrate are practically non-existent. Related applications are stored mediaapplications like DVD (Digital Versatile Discs) and PVR (Personal VideoRecorder).

Due to the high variability in the picture content present in many videosources, a long video sequence can be divided into consecutive videoshots. A video shot may be defined as a sequence of frames captured by“a single camera in a single continuous action in time and space”.Usually it is a group of frames that have consistent visual characters(including colour, texture, motion, etc.). Therefore a large number ofdifferent types of scene changes or scene cuts can exist between suchshots. A scene cut is an abrupt transition between two adjacent frames.Electronic scene cut detection as such is known. A common method is touse histograms for comparing consecutive video frames H. J. Zhang, A.Kankanhalli and S. W. Smoliar, “Automatic partitioning of full-motionvideo”, Multimedia Systems, volume 1, pages 10-28, 1993, SpringerVerlag. In Z. Cernekova, I. Pitas, Ch. Nikou, “Information Theory-BasedShot Cut/Fade Detection and Video Summarization”, IEEE CSVT, pages82-91, 2006, a Mutual Information (MI) is used for detecting scene cuts.

FIG. 1 shows that the picture ‘n’ following a scene cut SCC is usuallycoded as an intra frame picture I, and different bit allocation schemeswill be adopted in CBR and VBR processing. In case of CBR, the encoderwill try to keep the bit rate R_(CBR) constant, as FIG. 1 illustrates,which will often cause serious picture quality degradation at scenechanges. In case of VBR, more bits will be allocated to frame ‘n’ andthe bit rate R_(VBR) will increase significantly for a short time.Usually, subsequent frames will be coded in ‘skipped’ mode according tobuffer constraint or transmission rate constraint, i.e. R_(VBR) will benearly zero as FIG. 1 illustrates in order to soon return to the averagebit rate for the video sequence, which will often cause jerk artefactsin the video display. If an encoder does not handle a scene change well,the encoder will usually consider the picture following the scene cut(i.e. the first picture of the new scene) as a picture similar to theprevious one and therefore allocate bits accordingly. If this picture iscoded as a P or B frame, the picture quality will be seriouslydeteriorated due to less allocated bits.

Also, in most rate control algorithms the parameters from codingprevious pictures are usually used as candidate parameters for codingfuture pictures, which is not appropriate when a scene change occurs.This also results in a quality break and the more accurate the bit ratecontrol is, the more severe the problem is.

US-A-2005/0286629 proposes a method for scene cut coding usingnon-reference frames, wherein the scene cut frame and its neighbouringframes (before and after) are coded as non-reference frames (B frametype, as shown in FIG. 2) with increased quantisation parameters QP(i.e. with coarser quantisation) in order to reduce the bandwidth. Butin this case, the coding efficiency of the first P frame following thescene cut is very low due to the long prediction distance, and a longerpicture delay is required. A better performance in the B frame codingand a good trade-off between quality and rate constraint can not beassured.

INVENTION

Therefore, prior art scene change processing does not provide a goodtrade-off between resulting picture quality and buffer capacityconstraint.

A problem to be solved by the invention is to provide increased picturecoding/decoding quality and improved bit rate control for the pictureslocated near scene changes. This problem is solved by the methoddisclosed in claim 1. An apparatus that utilises this method isdisclosed in claim 2.

Based on the theory of ‘change blindness’, according to which theorypicture content change blindness is evident when picture errors occurfar from the viewer's focus of attention (cf. D. T. Levin, “Changeblindness,” Trends in Cognitive Science, 1997, vol. 1, no. 7, pp.261-267, and D. J. Simons and Ch. F. Chabris, “Gorillas in our midst:sustained inattentional blindness for dynamic events,” Perception, 1999,vol. 7, pp. 1059-1074, or D. J. Simons), a novel processing with optimumperceptual video coding based on attention area extraction is used inthis invention, which processing smoothes the output of the videoencoding/de-coding at scene changes without causing serious reduction ofthe subjective video quality.

Based on such information from attention area extraction and the effectsof temporal masking and change blindness, a perceptual bit allocationscheme for scene changes is disclosed. Following video input andattention area extraction, scene change period detection is performed ina pre-analysis step. Based on the information from the attention area orareas, optimised bit allocation is performed, thereby achieving a muchbetter subjective picture quality under buffer size constraint, and asignificantly better trade-off between spatial/temporal qualityassurance and buffer size constraint.

Advantageously, the invention can also be applied in case of specialscene changes like fade in/out or dissolve. The invention can becombined with different bit rate control schemes.

In principle, the inventive method is suited for adapting a defaultencoding of a digital video signal during a scene change period, whichperiod includes at least one picture prior to the first picture of thenew scene and at least one picture following said first picture of thenew scene, wherein these pictures and/or macro pixel blocks or pixelblocks of these pictures can be encoded in a predicted or in anon-predicted mode and wherein to quantised coefficients a number ofbits are assigned for the further encoding processing, said methodincluding the steps:

determining a scene cut and the corresponding scene change period;

in each picture belonging to said scene change period, determining areashaving at least two different human attention levels;

assigning in the picture or pictures located prior to said first newscene picture to the areas having an attention level corresponding tolower attention less bits than in said default encoding and assigning inthe pictures located at and after said scene cut the thus saved bitsadditionally to the areas having an attention level corresponding tohigher attention.

In principle the inventive apparatus is suited for adapting a defaultencoding of a digital video signal during a scene change period, whichperiod includes at least one picture prior to the first picture of thenew scene and at least one picture following said first picture of thenew scene, wherein these pictures and/or macro pixel blocks or pixelblocks of these pictures can be encoded in a predicted or in anon-predicted mode and wherein to quantised coefficients a number ofbits are assigned for the further encoding processing, said apparatusincluding:

means being adapted for determining a scene cut and the correspondingscene change period and for determining, in each picture belonging tosaid scene change period, areas having at least two different humanattention levels;

means being adapted for assigning in the picture or pictures locatedprior to said first new scene picture to the areas having an attentionlevel corresponding to lower attention less bits than in said defaultencoding and for assigning in the pictures located at and after saidscene cut the thus saved bits additionally to the areas having anattention level corresponding to higher attention.

Advantageous additional embodiments of the invention are disclosed inthe respective dependent claims.

DRAWINGS

Exemplary embodiments of the invention are described with reference tothe accompanying drawings, which show in:

FIG. 1 typical bit rate characteristic at scene change in the CBR andVBR cases;

FIG. 2 known quantisation parameter control at scene change;

FIG. 3 block diagram of the inventive encoder;

FIG. 4 example attention mask;

FIG. 5 illustration of frames the encoding of which is affected at scenechanges;

FIG. 6 flowchart of the inventive processing;

FIG. 7 example perceptual bit allocation for encoding an attention areaand a non-attention area at a scene change in the CBR case, andcorresponding buffer fullness level.

EXEMPLARY EMBODIMENTS

FIG. 4 shows an example attention mask for a picture, which attentionmask divides the macro-blocks of the picture into e.g. four levels ofattention importance. These levels M_(i) (i=1 . . . 4) represent thevisual importance of each corresponding set of macroblocks, wherein alarger M_(i) means that a larger coding/decoding distortion can betolerated because it is a visually less important area. The process ofdetermining the M_(i) levels and the corresponding macroblock areas isdescribed in L. Itti, Ch. Koch and E. Niebur, “A Model of Saliency-BasedVisual Attention for Rapid Scene Analysis”, IEEE Transactions on PatternAnalysis and Machine Intelligence, Vol. 20, No. 11, Nov. 1998, inapplication EP05300974.2 and in B. Breitmeyer, H. Ogmen, “Recent modelsand findings in visual backward masking: A comparison, review andupdate”, Perception & Psychophysics, 2000, 62(8), pp. 1572-1595.

FIG. 5 shows a ‘scene change period’ which represents a period of frameslocated close to (before and after) a scene cut. The correspondingwindow size L1+L2 for the scene change period is determined by temporalforward and backward masking effects described in US-A-2005/0286629 andin the above article from Breitmeyer et al.

Usually, L1=(1 . . . 4) and L2=(2 or 3), for example L1=2 and L2=3 wereused in simulations. The scene change period detection can be performedusing the same method as for scene cut detection.

FIG. 3 shows a block diagram of a known hybrid encoder to which twoadditional functions or blocks are attached. A video input signal VIN isfed to an intra/inter switching block or stage I/P, to a motionestimator ME and to a pre-analysis block or stage PREA. The outputsignal from stage I/P passes through a transform/quantisation block orstage T/Q to a corresponding inverse transform/inverse quantisationblock or stage T⁻¹/Q⁻¹ and to an entropy encoding block or stage ENTCwhich includes a buffer and e.g. Huffman encodes the transformcoefficients TRC. The fullness of this buffer may control via aquantisation parameter QP the quantiser step size in T/Q and T⁻¹/Q⁻¹.For non-predicted frames or macroblocks or blocks, the output of stageT⁻¹/Q⁻¹ passes through a one-frame delay L⁻¹ and a set of frame buffersRFB to a second input of motion estimator ME. For predicted frames ormacroblocks or blocks, the output of stage T⁻¹/Q⁻¹ passes through amotion compensator MC and delay L⁻¹ and frame buffers RFB to the secondinput of motion estimator ME. ME calculates macr block or block motionvectors MV which are used in MC and also passed to entropy encodingstage ENTC that outputs the encoded video bitstream VSTR. VSTR alsoincludes header and side information HSI. For non-predicted frames ormacroblocks or blocks, stage I/P passes signal VIN to stage T/Q. Forpredicted frames or macroblocks or blocks, the motion compensatedmacroblocks or blocks from motion estimator ME are subtracted in stageI/P from signal VIN and the difference signal only is fed to stage T/Q.The pre-analysis block or stage PREA determines scene cut framepositions and multi-level (at least two) attention areas in thecorresponding adjacent frames. Based on the information items from stagePREA, a coder control block or stage CCTRL sends first controlinformation C1 to stages T/Q and T⁻¹/Q⁻¹ to control or adapt thequantisation of each macroblock or block, and sends second controlinformation C2 to stage I/P for selecting the suitable mode (I, P, B)for each frame and/or the suitable mode (I, P, B) for each macroblock orblock in the scene change period L1+L2. Stage CCTRL may also receive thecurrent candidate quantisation parameter QP from stage ENTC in order toadapt it to the scene change encoding.

For detecting scene cuts, stage PREA may use the processing disclosed inthe above-mentioned articles from Zhang et al. and Cerhekova et al.

For detecting attention areas and attention levels, stage PREA may usethe processing disclosed in the above-mentioned article from Itti et al.and in SP05300974.2. Humans always pay more attention to some part ofthe picture rather than to other parts. The ‘attention area’ is theperceptual sensitive area in a picture which catches more attention froma human.

Basically, as described in the above Itti et al. article, an attentionmap or saliency map is calculated by determining e.g. colour contrast,luminance contrast and object or border orientation contrast maps of apicture, followed by calculating centre-surround differences,normalisation and linear combination. In other known attentioncalculation methods, first a set of feature maps in grey level isextracted from the visual input of a given image. These features includeluminance intensity, colour, orientation, and so on.

Thereafter in each feature map the most salient areas are picked out.Finally all feature maps are integrated, in a purely bottom-up manner,into a master ‘saliency map’, which is regarded as the attentioninformation of a picture. Therefrom the attention mask is obtained foreach picture, describing the different attention importance levels ofdifferent areas of a picture.

In the inventive processing depicted in FIG. 6, the video input signalVIN is checked in step SCCPD for scene cuts or changes and for thecorresponding period. If the result of this check is ‘no’, the frametype and the macroblock or block type are determined in step FRMBTSELaccording to default schemes (using any existing rate control andswitching processing), followed by a known default encoding DEFCOD. Ifthe result of check SCCPD is ‘yes’, the attention areas and levels ofthe corresponding frames are extracted in step ATTAEXT based on theabove-mentioned processing. Based on the determined attention areas andlevels, in step FRLBALL a corresponding bit allocation is carried outfor corresponding frames. This frame level bit allocation adjustment isonly applied to the frames belonging to the scene cut period L1+L2.Thereafter an attention based macroblock or block level bit allocationis made in these frames in step MBLBALL, based on the attention areas inthe frames.

In FIG. 7 the area marked by diagonal lines is an attention area and thearea marked by cross lines is a non-attention area. When a scene cuthappens, the attention area has the highest priority to be allocatedwith more bits in order to keep a high subjective picture quality withinthe attention area, while the non-attention area has a lower priority tobe allocated bits.

E.g. for the frames n−2 and n−1 located prior to the scene cut, bits areremoved from the non-attention area to free buffer occupancy, whichintroduces negligible degradation on subjective quality due to theeffect of temporal backward masking. E.g. for the frames n, n+1, n+2 andn+3 of the new scene, bits can also be removed from the non-attentionarea to free buffer occupancy, which introduces neglectable degradationon subjective quality due to the effect of temporal forward masking.

In other words, ΔR_(i) is removed from the non-attention area in thei-th frame inside the scene cut period (−L1≦i≦L2), wherein i=0represents the scene cut frame and i<0 represents the frames prior tothe scene cut) as defined in the following equation:

${{\Delta \; R_{i}} = {\left( {\frac{K_{i}\rho}{1 - \rho + {K_{i}\rho}} - \rho} \right)R_{F_{i}}}},$

wherein i is the running frame, ρ is the proportion of the attentionarea or areas inside the whole frame, R_(F)is the total bit rate forframe i, and K_(i) is a control factor (K_(i)>1) that leads to a bettertrade-off between attention area picture quality and non-attention areapicture quality and that can also be changed adaptively in practicalapplications. The size of K_(i) depends on the relative complexitybetween the attention area or areas and the non-attention area or areas.E.g., if an attention area has a much higher complexity K_(i) will belarger.

For the frames i<0, the removed bits from the non-attention area aresaved to free buffer occupancy while for the frames i≧0 the removed bitsare re-allocated to the attention area to improve the perceptual videoquality. The adjustment of the bit allocation for different parts of theframes inside the scene cut period efficiently improves the subjectivepicture quality under the same total bit allocation for these frames, aswell as efficiently reduce the buffer occupancy before the scene cut andreduce the probability of buffer overflow after the scene cut.

Coding skipped frames can be reduced by putting limited resources tolower-attention areas and thereby reducing necessary bits for encodingscene cut frames. This improves decoder synchronisation and removes jerkartefacts. Due to the change blindness property of the human visualsystem, a better trade-off between spatial/temporal picture qualityassurance and buffer rate constraint can be achieved.

The inventive processing can also be based on fields instead of frames,i.e. in a more general manner the invention applied on pictures of adigital video signal.

The invention is compatible with other rate control schemes and suitablefor any existing video coding standards like Video, MPEG-4 AVC/H.264,H.263, etc.

1-10. (canceled)
 11. Method for adapting a default encoding of a digitalvideo signal during a scene change period, which period includes atleast one picture prior to the first picture of the new scene and atleast one picture following said first picture of the new scene, whereinthese pictures and/or macro pixel blocks or pixel blocks of thesepictures can be encoded in a predicted or in a non-predicted mode andwherein to quantized coefficients a number of bits are assigned for thefurther encoding processing, said method comprising the steps:determining a scene cut and the corresponding scene change period; ineach picture belonging to said scene change period, determining areashaving at least two different human attention levels; assigning in thepicture or pictures located prior to said first new scene picture to theareas having an attention level corresponding to lower attention lessbits than in said default encoding and assigning in the pictures locatedat and after said scene cut the thus saved bits additionally to theareas having an attention level corresponding to higher attention. 12.Method according to claim 11, wherein said scene change period includesone, two, three or four pictures prior to the first picture of the newscene and one or two pictures following said first picture of the newscene.
 13. Method according to claim 11, wherein said assigning alsoincludes assigning a suitable encoding mode to each frame and/or to eachmacroblock or block.
 14. Method according to claim 11, wherein saidassigning of bits is also performed at picture level.
 15. Methodaccording to claim 11, wherein also in the pictures located at and aftersaid scene cut less bits are assigned to the areas having an attentionlevel corresponding to lower attention and the thus saved bits areadditionally assigned in these pictures to the areas having an attentionlevel corresponding to higher attention.
 16. Method according to claim15, wherein said assigning of less bits to lower attention areas isincreased picture by picture until said first picture of the new sceneand is decreased picture by picture in the pictures following said firstpicture of the new scene, and wherein said assigning of more bits tohigher attention areas is decreased picture by picture starting fromsaid first picture of the new scene.
 17. Method according to claim 16,wherein said bit rate is decreased in said lower attention areas in thei-th picture by${{\Delta \; R_{i}} = {\left( {\frac{K_{i}\rho}{1 - \rho + {K_{i}\rho}} - \rho} \right)R_{F_{i}}}},$wherein i is the running picture inside said scene cut period and i=0represents said first picture of the new scene and i<0 represents thepictures prior to said first picture, ρ is the proportion of theattention area or areas inside a whole picture, R_(F) _(i) is the totalbit rate for frame i, and K_(i) is a control factor greater ‘1’ thatleads to a better trade-off between higher attention area picturequality and lower attention area picture quality.
 18. Apparatus foradapting a default encoding of a digital video signal during a scenechange period, which period includes at least one picture prior to thefirst picture of the new scene and at least one picture following saidfirst picture of the new scene, wherein these pictures and/or macropixel blocks or pixel blocks of these pictures can be encoded in apredicted or in a non-predicted mode and wherein to quantizedcoefficients a number of bits are assigned for the further encodingprocessing, said apparatus including: means being adapted fordetermining a scene cut and the corresponding scene change period, andfor determining, in each picture belonging to said scene change period,areas having at least two different human attention levels; means beingadapted for assigning in the picture or pictures located prior to saidfirst new scene picture to the areas having an attention levelcorresponding to lower attention less bits than in said default encodingand for assigning in the pictures located at and after said scene cutthe thus saved bits additionally to the areas having an attention levelcorresponding to higher attention.
 19. Apparatus according to claim 18,wherein said scene change period includes one, two, three or fourpictures prior to the first picture of the new scene and one or twopictures following said first picture of the new scene.
 20. Apparatusaccording to claim 18, wherein said assigning means also assigns asuitable encoding mode to each frame and/or to each macroblock or block.21. Apparatus according to claim 18, wherein said assigning of bits isalso performed at picture level.
 22. Apparatus according to claim 18,wherein also in the pictures located at and after said scene cut lessbits are assigned to the areas having an attention level correspondingto lower attention and the thus saved bits are additionally assigned inthese pictures to the areas having an attention level corresponding tohigher attention.
 23. Apparatus according to claim 22, wherein saidassigning of less bits to lower attention areas is increased picture bypicture until said first picture of the new scene and is decreasedpicture by picture in the pictures following said first picture of thenew scene, and wherein said assigning of more bits to higher attentionareas is decreased picture by picture starting from said first pictureof the new scene.
 24. Apparatus according to claim 22, wherein said bitrate is decreased in said lower attention areas in the i-th picture by${{\Delta \; R_{i}} = {\left( {\frac{K_{i}\rho}{1 - \rho + {K_{i}\rho}} - \rho} \right)R_{F_{i}}}},$wherein i is the running picture inside said scene cut period and i=0represents said first picture of the new scene and i<0 represents thepictures prior to said first picture, ρ is the proportion of theattention area or areas inside a whole picture, R_(F) _(i) is the totalbit rate for frame i, and K_(i) is a control factor greater ‘1’ thatleads to a better trade-off between higher attention area picturequality and lower attention area picture quality.
 25. A digital videosignal that has been encoded using the method according to claim
 11. 26.Storage medium, for example on optical disc, that contains or stores, orhas recorded on it, a digital video signal encoded according to themethod of claim 11.